Search CORE

6 research outputs found

Efficient and Robust Methods for Audio and Video Signal Analysis

Author: Mahkonen Katariina
Publication venue: Tampere University of Technology
Publication date: 01/01/2018
Field of study

This thesis presents my research concerning audio and video signal processing and machine learning. Specifically, the topics of my research include computationally efficient classifier compounds, automatic speech recognition (ASR), music dereverberation, video cut point detection and video classification.Computational efficacy of information retrieval based on multiple measurement modalities has been considered in this thesis. Specifically, a cascade processing framework, including a training algorithm to set its parameters has been developed for combining multiple detectors or binary classifiers in computationally efficient way. The developed cascade processing framework has been applied on video information retrieval tasks of video cut point detection and video classification. The results in video classification, compared to others found in the literature, indicate that the developed framework is capable of both accurate and computationally efficient classification. The idea of cascade processing has been additionally adapted for the ASR task. A procedure for combining multiple speech state likelihood estimation methods within an ASR framework in cascaded manner has been developed. The results obtained clearly show that without impairing the transcription accuracy the computational load of ASR can be reduced using the cascaded speech state likelihood estimation process.Additionally, this thesis presents my work on noise robustness of ASR using a nonnegative matrix factorization (NMF) -based approach. Specifically, methods for transformation of sparse NMF-features into speech state likelihoods has been explored. The results reveal that learned transformations from NMF activations to speech state likelihoods provide better ASR transcription accuracy than dictionary label -based transformations. The results, compared to others in a noisy speech recognition -challenge show that NMF-based processing is an efficient strategy for noise robustness in ASR.The thesis also presents my work on audio signal enhancement, specifically, on removing the detrimental effect of reverberation from music audio. In the work, a linear prediction -based dereverberation algorithm, which has originally been developed for speech signal enhancement, was applied for music. The results obtained show that the algorithm performs well in conjunction with music signals and indicate that dynamic compression of music does not impair the dereverberation performance

Trepo - Institutional Repository of Tampere University

Efficient and Robust Methods for Audio and Video Signal Analysis

Author: Mahkonen Katariina
Publication venue: Tampere University of Technology
Publication date: 01/01/2018
Field of study

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Cascade of Boolean detector combinations

Author: Kämäräinen Joni
Mahkonen Katariina
Virtanen Tuomas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/07/2018
Field of study

This paper considers a scenario when we have multiple pre-trained detectors for detecting an event and a small dataset for training a combined detection system. We build the combined detector as a Boolean function of thresholded detector scores and implement it as a binary classification cascade. The cascade structure is computationally efficient by providing the possibility to early termination. For the proposed Boolean combination function, the computational load of classification is reduced whenever the function becomes determinate before all the component detectors have been utilized. We also propose an algorithm, which selects all the needed thresholds for the component detectors within the proposed Boolean combination. We present results on two audio-visual datasets, which prove the efficiency of the proposed combination framework. We achieve state-of-the-art accuracy with substantially reduced computation time in laughter detection task, and our algorithm finds better thresholds for the component detectors within the Boolean combination than the other algorithms found in the literature.publishedVersionPeer reviewe

Directory of Open Access Journals

Trepo - Institutional Repository of Tampere University

Exemplar-based recognition of speech in highly variable noise

Author: Gemmeke Jort
Hurmalainen Antti
Mahkonen Katariina
Virtanen Tuomas
Publication venue
Publication date: 01/01/2011
Field of study

Hurmalainen A., Mahkonen K., Gemmeke J.F., Virtanen T., ''Exemplar-based recognition of speech in highly variable noise'', Proceedings 1st international workshop on machine listening in multisource environments - CHiME 2011 (satellite event of Interspeech 2011), 6 pp., September 1, 2011, Florence, Italy.status: publishe

Lirias

Mapping Sparse Representation to State Likelihoods in Noise-Robust Automatic Speech Recognition

Author: Gemmeke Jort
Hurmalainen Antti
Mahkonen Katariina
Virtanen Tuomas
Publication venue: 'The International Fiscal Association of Korea'
Publication date: 01/01/2011
Field of study

status: publishe

Lirias

Cascade of Boolean detector combinations

Author: AP Kamath
AS Deshpande
C Shen
C Shen
E Boros
E Granger
G Alexe
I Chikalov
Joni Kämäräinen
JR Quinlan
JR Quinlan
Katariina Mahkonen
L Breiman
MJ Saberian
P Hess
PL Hammer
Q Tao
R Feraund
S Alexe
S Petridis
SN Sanchez
T Fawcett
TO Bonates
Tuomas Virtanen
W Khreich
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/07/2018
Field of study

Abstract This paper considers a scenario when we have multiple pre-trained detectors for detecting an event and a small dataset for training a combined detection system. We build the combined detector as a Boolean function of thresholded detector scores and implement it as a binary classification cascade. The cascade structure is computationally efficient by providing the possibility to early termination. For the proposed Boolean combination function, the computational load of classification is reduced whenever the function becomes determinate before all the component detectors have been utilized. We also propose an algorithm, which selects all the needed thresholds for the component detectors within the proposed Boolean combination. We present results on two audio-visual datasets, which prove the efficiency of the proposed combination framework. We achieve state-of-the-art accuracy with substantially reduced computation time in laughter detection task, and our algorithm finds better thresholds for the component detectors within the Boolean combination than the other algorithms found in the literature

Crossref

Directory of Open Access Journals

Trepo - Institutional Repository of Tampere University